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Abstract 

As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation 
(LDA) has found many important applications in text mining, computer vision and com- 
putational biology. Recent training algorithms for LDA can be interpreted within a unified 
message passing framework. However, message passing requires storing previous messages 
with a large amount of memory space, increasing linearly with the number of documents 
or the number of topics. Therefore, the high memory usage is often a major problem for 
topic modeling of massive corpora containing a large number of topics. To reduce the space 
complexity, we propose a novel algorithm without storing previous messages for training 
LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing 
algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the 
message updating into the message passing process, and thus avoid storing previous mes- 
sages. Experimental results on four large data sets confirm that TBP performs comparably 
well or even better than current state-of-the-art training algorithms for LDA but with a 
much less memory consumption. TBP can do topic modeling when massive corpora cannot 
fit in the computer memory, for example, extracting thematic topics from 7GB PUBMED 
corpora on a common desktop computer with 2GB memory. 

Keywords: Topic models, latent Dirichlet allocation, tiny belief propagation, non- 
negative matrix factorization, memory usage. 



1. Introduction 



Latent Dirichlet allocation (LDA) (jBlei et al.1 . bood ) is a three-layer hierarchical Bay esian 



model for probabilistic topic modeling, computer vision and computational biology (jBlel 



20121 ) ■ The collections of documents can be represented as a document-word co-occurrence 
matrix, where each element is the number of word count in the specific document. Modeling 
each document as a mixture topics and each topic as a mixture of vocabulary words, LDA 
assigns thematic labels to explain non-zero elements in the document-word matrix, segment- 
ing observed words into several thematic groups called topics. Prom the joint probability of 
latent labels and observed words, existing training algorithms of LDA approximately infers 
the posterior probability of topic labels given observed words, and estimate multinomial 
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parameters for document-specific topic proportions and topic distributions of vocabulary 
words. The time and space complexity of these training algorithms depends on the number 
of non-zero {NNZ) elements in the matrix. 

Probabilistic topic modeling for massive corpora has attracted intense interests recently. 
This research line is motivated by increasingly common massive data sets, such as online 
distributed texts, images and videos. Extracting and analyzing the large number of topics 
from these massive data sets brings new challenges to current topic modeling algorithms, 
particularly in computation time and memory requirement. In this paper, we focus on 
reducing the memory usage of topic modeling for massive corpora, because the memory 
limitation prohibits running existing topic modeling algorithms. For example, when the 
document-word matrix has NNZ = 5 x 10^, existing training algorithms of LDA often 
requires allocating more than 12GBytes memory including space for data and parameters. 
Such a topic modeling task cannot be done on a common desktop computer with 2GB 
memory even if we can tolerate the slow speed of topic modeling. 

Because computing the exact posterior of LDA is intractable, we must adopt approx- 
imate inference methods for training LDA. Modern approximate posteri or inference algo - 



rithms for LDA fall broadly into three categories: varia. t ional Bayes (VB) ( Blei et al. . 20031 ) . 



collap sed Gibbs s amplin g (GS) (|Griffiths and Stevversl . l2004l ) . and loopy belief propagation 



(BP) (IZeng et al 



2011 



). We may interpret these methods within a unified message passing 



framework ( Bishop . 20061 ). which infers the approximate marginal posterior distribution of 
the topic label fo r each word call e d me ssage. According to the expectation-maximization 



(EM) algorithm (jDempster et al.l . 119771 ). the local inferred messages are used to estimate 



the best multinomial parameters in LDA based on the maximum-likelihood (ML) criterion. 

VB is a variational message passing algorithm ( Winn and Bishopl . 20051 ). which infers 
the message from a factorizable variational distribution to be close in Kullback-Leibler (KL) 
divergence to the joint distribution. The gap between variational and true joint distribu- 
tions cause VB to use computationally expensive digam ma functions, introducing biases and 



slown ess in the message updating and passing process (^ Asuncion et al.l . l2009l : IZeng et al 



20111). GS is based on Markov chain Monte Carlo (MCMC) sampling process, whose sta- 



tionary distribution is the desired joint distribution. GS usually updates its message using 
the sampled topic labels from previous messages, which does not keep all uncertainties of 
previous messages. In contrast, BP directly updates and passes the entire messages without 
sampling, and thus achieves a much higher topic modeling accuracy. Till now , BP is very 



competitive in both speed and accuracy for topic modeling ( Zeng et al. . 201 ll ). Similar BP 



ideas have also been discussed as the zero-or der approximation of the collapsed VB (CVBO) 
algorithm within the mean- field framework ( Asuncion et al. . 20091 : Asuncion! . 2O10l ). 

However, the message passing techniques often require storing previous messages for 
updating and passing, which leads to the high memory usage increasing linearly with the 
number of documents or the number of topics. So, to save the memory usage, we pro- 
pose a novel algorithm for training LDA: tiny belief propagation (TBP). The basic idea 
of TBP is inspired by the mu ltiplicative update rules of non-negative matrix factorization 
(NMF) dLee and Seund . l200lh . which absorbs the message updating into passing process 
without storing previous messages. Extensive experiments demonstrate that TBP enjoys 
a significantly less memory usage for topic modeling of massive data sets, but achieves a 
comparable or even better topic modeling accuracy than VB, GS and BP. Moreover, the 
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speed of TBP is ve ry close to BP, w hich is currently the fastest batch learning algorithm 



for topic modeling (IZeng et al.l. l201lh. W e also extend the proposed TBP using the block 



optimization framework ( Yu et al. . 2O10l ) to handle the case when data cannot fit in com- 
puter memory. For example, we extend TBP to extract 10 topics from 7GB PUBMED 
biomedical corpus using a desktop computer with 2GB memory. 

There have been two straight-forward machine learning strategies to process large-scale 
data sets: online and parallel learning schemes. On the one hand, online topic modeling 
algorithms such as online VB (OVB) (jHoffman et al.l . l2O10l ^ read massive corpora as a 
data stream composed of multiple smaller mini-batches. Loading each smaller mini-batch 
into memory, OVB optimizes LDA within the online stochastic optimization framework, 
theoretically converging to the batch VB's objective function. But OVB still needs to store 
messages for each mini-batch. When the size of mini-batch is large, the space complexity 
of OVB is still higher than the batch training algorithm TBP. In addition, the best online 
topic modeling performance depends highly on several heuristic parameters including the 
mini-batch size. On the o t her h and, parallel topic modeling algorithms such as parallel 
GS (PGS) ([Newman et al.l . l2009l ) use expensive parallel architectures with more physical 
memory. Indeed, PGS does not reduce the space complexity for training LDA, but it 
distributes massive corpora into P distributed computing units, and thus requires only 1/P 
memory usage as GS. By contrast, the proposed TBP can reduce the space complexity for 
batch training LDA on a common desktop computer. Notice that we may also develop 
much more efficient online and parallel topic modeling algorithms based on TBP in order 
for a significant speedup. 

The rest paper is organized as follows. Section [2] compares VB, GS and BP for message 
passing, and analyzes their space complexity for training LDA. Section [3] proposes the TBP 
algorithm to reduce the space complexity of BP, and discusses TBP's relation with the 
multiplicative update rules of NMF. Section [5] shows extensive experiments on four real- 
world corpora. Finally, Section [5] draws conclusions and envisions future work. 



2. The Message Passing Algorithms for Training LDA 

LDA allocates a set of semantic topic labels, z = {-z^^}, to explain non-zero elements in 
the document-word co-occurrence matrix xv^xD = {xw,d\^ where 1 < w < W denotes the 
word index in the vocabulary, \ < d < D denotes the document index in the corpus, and 
1 < k < K denotes the topic index. Usually, the number of topics K is provided by users. 
The topic label satisfies ^ = {0, 1}, X^fcLi d = 1- After inferring the topic labeling 
configuration over the document-word matrix, LDA estimates two matrices of multinomial 
parameters: topic distributions over the fixed vocabulary 4>wxK = W-,k}, where O. ^ is a K- 
tuple vector and (p.^k is a Il^-tuple vector, satisfying ^^i. 9k^d = 1 and J2w 't'w,k = 1- From a 
document-specific proportion 9.^^, LDA independently generates a topic label z^^ = 1, which 
further combines (jj.^k to generate a word index forming the total number of observed 
word counts x^j^d- Both multinomial vectors 6.^4 and (p.^k are generated by two Dirichlet 
distributions with hyperparameters a and (3. For simplicit y, we consider the smoothed LDA 
with fixed symmetric hyperparameters provided by users JorifEths and Stevversl . lioni v To 



illustrate the generative process, we refer the readers to the original three-layer graphical 
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Figure 1: Message passing for training LDA: (A) collapsed Gibbs sampling (GS), (B) loopy 
belief propagation (BP), and (C) variational Bayes (VB). 



repre sentation fo r LDA ( Blei et al. . 20031 ) and the two-layer factor graph for the collapsed 



LDA iZeng et alJ (|201lh . 

Recently, there have been three types of message passing algorithms for training LDA: 
GS, BP and VB. These message passing algorithms have space complexity as follows. 

Total memory usage = data memory + message memory + parameter memory, (1) 

where the data memory is used to store the input document- word matrix xwxD, the message 
memory is allocated to store previous messages during passing, and the parameter memory 
is used to store two output parameter matrices <PwxK ^KxD- Because the input and 
output matrices of these algorithms are the same, we focus on comparing the message 
memory consumption among these message passing algorithms. 

2.1 Collapsed Gibbs Sampling (GS) 

After integrating out the multinomial parameters It/), LDA becomes the col l apsed LDA 
in the collapsed hidden variable space {z,a,/3}. GS dGriffiths and Stevverl l2004l ^ IS a 



Markov Chain Monte Carlo (MCMC) sampling technique to infer the marginal distribution 
or message, fJ-w,d,n{f^) = Pi^tdn ~ -*■)' where 1 < n < Xw,d is the word token index. The 
message update equation is 

where z*^^_.„ = ^u,^t,d,-n^ '^t,--n = Y,d^t,d-n^ ^^"^ the notation -n denotes excluding 
the current topic label z^^^. After normalizing the message Ylk f^w,d,n{k) = 1, GS draws 
a random number u ~ Uniform [0, 1] and checks which topic segment will be hit as shown 
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in Fig. [T]A, where K = A for example. If the topic index A; = 3 is hit, then we assign 
^ dn ~ ^- "^^^ sampled topic label will be used immediately to estimate the message for 



the next word token. If we view the sampled topic lab els as particles, GS can be interpreted 



as a special case of non-parametric belief propagation (jSudderth et al.l . l2003l ) . in which only 
particles rather than complete messages are updated and passed at each iteration. Eq. ([2]) 
sweeps all word tokens for 1 < t < T training iterations until the convergence criterion is 
satisfied. To exclude the current topic label ^ ^ in Eq. (I2|), we need to store all topic labels, 
dn ~ Vii;, d, n, in memory for message passing. In a common 32-bit desktop computer, 
GS generally uses the integer type (4 bytes) for each topic label, so the approximate message 
memory in bytes can be estimated by 

GS = 4x^x^,d, (3) 

w,d 

where ^^d-'^w^ is the total number of word tokens in the document-word matrix. For 
example, 7GB PUBMED corpus has 737, 869, 083 word tokens, occupying around 2.75GB 
message memory according to Eq. ([3]). 

Based on inferred topic configuration ^ „ over word tokens, the multinomial parame- 
ters can be estimated as follows. 



<.,+/3 



7k,d 



(4) 



(5) 



These equations look similar to Eq. ([2]) except including the current topic label z'^ in 
both numerator and denominator. 

2.2 Loopy Belief Propagation (BP) 

Similar to GS, BP (jZeng et al.l . l201lh performs in the collapsed hidden variable space of 



LDA called collapsed LDA. The basic idea is to integrate out the multinomial parameters 
{6,(1)}, and infer the marginal posterior probability in the collapsed space {z,a,(3}. The 
collapsed LDA can be represented by a factor graph, which facilitates the BP algorithm for 
approximate inference and parameter estimation. Unlike GS, BP infers messages, Hw,d{k) = 
P(^w d ~ -'-)' without sampling in order to keep all uncertainties of messages. The message 
update equation is 

f^.,dW OC + X ^Jf,^ _^ik) + /3] ' 

where fi-ujAk) = 

X—w,dl^—w,d 

(fc) and fJ-w-dik) = ^_dXw-dlJ'w-d{k)- The notation 
—w and —d denote all word indices except w and all document indices except d. After 
normalizing Ylk f^w,d{k) = 1, BP updates other messages iteratively. Fig. [TJ3 illustrates the 
message passing in BP when K = 4, slightly different from GS in Fig. [T]A. Eq. ([6]) differs 
from Eq. ([2]) in two aspects. First, BP infers messages based on word indices rather than 
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word tokens. Second, BP updates and passes complete messages without sampling. In this 
sense, BP can be viewed as a soft version of GS. Obviously, such differences give Eq. ([6|) two 
advantages over Eq. First, it keeps all uncertainties of messages for high topic modeling 
accuracy. Second, it scans a total of NNZ word indices for message passing, which is 
significantly less than the total number of word tokens Ylwd^-w,d in x. So, BP is often 
faster than GS by scannin g a significantly l ess number of elements (NNZ <C d^w,d) 
at each training iteration ( Zeng et al. . 201 ll ). Eq. ([6]) scans NNZ in the document-word 



matrix for 1 < t < T training iterations until the convergence criterion is satisfied. 

However, BP has a higher space complexity than GS. Because BP excludes the current 
message Hw,d{k) in message update ([6]), it requires storing all i^T-tuple messages. In the 
widely-used 32-bit desktop computer, we generally use the double type (8 bytes) to store 
all messages with the memory occupancy in bytes, 

BF = 8x K X NNZ, (7) 

which increases linearly with the number of topics K. For example, 7GB PUBMED corpus 
has NNZ = 483, 450, 157. When K = 10, BP needs around 36GB for message passing. 
Notice that when K is large, Eq. ([7]) is significantly higher than Eq. ([3]). 

Based on the normalized messages, the multinomial parameters can be estimated by 



'-'w,k 



12k\t^-,d{k) +a]' 



(9) 



These equations look similar to Eq. Q except including the current message ^w.dik) in 
both numerator and denominator. 



2.3 Variational Bayes (VB) 

Unlike BP in the collapsed space, VB (|Blei et al.l . l2003l : IWinn and Bishopl . l200.5h passes 
variational messages, fiw,d{k) = p{z^ d ~ -*■)' derived from the approximate variational 
distribution p to the true joint distribution p by minimizing the KL divergence, KL{p\\p). 
The variational message update equation is 



exp[^(/i.w(A:) + «)] 



exp[^(Efc[/i.,d(fc) + «])] YlJf^^M + /3] ' 



(10) 



where jj-.^k) = Y,w ^w,dp-w,dik), iJ-w,Xk) = Y.d^wAlJ'w^ik), and the notation exp and ^' are 
exponential and digamma functions, respectively. After normalizing the variational message 
f^w,dik) = 1, VB passes this message to update other messages. There are two major 
differences between Eq. (jlOp and Eq. First, Eq. (jlOp involves computationally expensive 
digamma functions. Second, it include the current variational message flw,d in the update 
equation. The di gamma functioii signif i cantly slows down VB, and also introduces bias in 
message passing ( Asuncion et al. , 20091 : Zeng et al. . 201 ll ). Fig. [T]C shows the variational 
message passing in VB, where the dashed line illustrates that the variational message is 
derived from the variational distribution. Because VB also stores the variational messages 
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for updating and passing, its space complexity is the same as BP in Eq. ([7|). Based on the 
normahzed variational messages, VB estimates the multinomial parameters as 



9w,k = ^F^-p — TTVVm' v^J^j 
22w[t^wAk) + f3] 

fi. + a 

These equations are almost the same as Eqs. ([8]) and ([9]) but using variational messages. 
2.4 Synchronous and Asynchronous Message Passing 

Message passing algorithms for LDA first randomly initialize messages, and then pass mes- 
sages according to two schedule s: the synchronous and the asynchronous update sched- 
ules ( Tappen and Freeman . 20031 ). The synchronous message passing schedule uses all mes- 



sages at t — 1 training iteration to update current messages at t training iteration, while 
the asynchronous schedule immediately uses the updated messages to update other remain- 
ing messages within the same t training iteration. Empirical results demonstr ate that the 



async hronous schedule is slightly more efficient than the synchronous schedule (jZeng et al. 



20111 ) for topic modeling. However, the synchronous schedule is much easier to extend for 
parallel computation. 

GS is naturally an asynchronous message passing algorithm. The sampled topic label will 
immediately influence the topic sampling process at the next word token. Both synchronous 
and asynchronous schedules of BP work equally well in terms of topic modeling accuracy, but 



the a synchronous schedule converges slightly faster than the synchronous one (jElidan et al 



2006l l. VB is a synchronous variational message passing algorithm, updating messages at 



iteration t using messages at iteration t — 1. 



3. Tiny Belief Propagation 

In this section, we propose TBP to save the message memory and data memory usage of 
BP in section 12.21 Generally, the parameter memory of BP takes a relatively smaller space 
when the number of topics K is small. For example, as far as 7GB PUBMED data set is 
concerned (D = 8, 200, 000 and W = 141043), when K = 10, the parameter 9kxD occupies 
around 0.6GB memory, while the parameter 4>wxK occupies around 0.01GB memory. For 
simplicity, we assume that the parameter memory is enough for topic modeling. 



3.1 Message Memory 

The algorithmic contribution of TBP is to reduce the message memory of BP to almost zero 
during message passing process. Combining Eqs. ([6]), ([8]) and ([9]) yields the approximate 
message update equation, 

fJ-wA^) °^ (t>w,k X 9k,d, (13) 

where the current message //«,,d(^) is added in both numerator and denominator in Eq. ([6]). 
Notice that such an approximation does not distort the message update very much because 
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the message ^w,d{k) is significantly smaller than the aggregate of other messages in both 
numerator and denominator. Eq. (jl3p has the following intuitive explanation. If the wih. 
word has a higher likelihood in the topic k and the topic k has a larger proportion in the 
dth document, then the topic k has a higher probability to be assigned to the element Xw,d-, 
i.e., ^ = 1. The normalized message can be written as the matrix operation, 

/, X <Pw,k(^k,d .N 

t^wAk) = -rr^. — , (14) 

where {4>(^)w,d is the element at {w, d} after matrix multiplication cjjO. Within the prob- 
abilistic framework, LDA generates the word token at index {u;, d} using the likelihood 
{4>9)w,d, which satisfies Yliwi^^)-w4 = 1) so that di^^)w,d = D is a constant. Replacing 
the normalized messages by Eq. (jl4p . we can re- write Eqs. ([8]) and Q as 

(t>w,k 



Ed x^A'l^w,kOk,d/ i4>9)w,d] + /3 ^-j^gxj 



7k,d 



Y.w,d^wA^w,kOk4/i4>9)w4] + W(3' 
Yyw XwA^w,kOk,d/ {<t>(f)w,d] + a 



J2w + Ka 



(16) 



where the denominators play normalization roles to constrain Y2k^k,d = ^,()k,d > and 
^w^-w,k = ^,4'w,k > 0. We absorb the message update equation into the parameter esti- 
mation in Eqs. (jl5p and (jl6p . so that we do not need to store the previous messages during 
message passing process. We refer to these matrix update algorithm as TBP. 

If we discard the hyperparameters a and /3 in Eqs. (jl6p and (|15p . we find that these 
matrix update equations look similar to the following mult iplicative update rules in non- 
negative matrix factorization (NMF) ( Lee and Seungj . 200ll ). 



/ , Y^d^wA'i^w,kdk4/{4>(^)w,d\ /,^x 

(Pw,k< F^— ^ , il'j 

l^d ^k,d 

a J Y.w^wA4'w,k0k,d/{(t>G)w,d] i^^. 

^k,d < ' ^^^^ 



where the objective of NMF is to minimize the following divergence. 



l)(x||00) = (x^AoS - + (00)«,,d ) , (19) 



w,d 

under the constraints (jjw^ ^ and Oj^^d > 0. First, Eqs. (|16p and ()15p are different from 
Eqs. (|17p and (jlSp in denominators, just because LDA additionally constrain the sum of 
multinomial parameters to be one. Second, as far as LDA is concerned, because E^uj d^w,d 
and d^4>9)w,d are constants, Eq. p9p is proportional to the standard Kullback-Leibler 
(KL) divergence, 



•^w,d 
) w,a 



oc ^ -x^,dlog(00)^,rf. (20) 



w,d 
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In conclusion, if we discard hyperparameters in Eqs. (jlSp and Eq. ()16p . the proposed TBP 
algorithm becomes a special NMF algorithm: 



mm I ^ -x^^d log(00)t„,d ) , Vx^,d 7^ 0, 



> 



1, 



> 



(21) 
(22) 

(23) 



where TBP focuses only on approximating non-zero elements Xy^ d 7^ by (j)6 in terms of 
the KL divergence. Notice that the hyperparameters play smoothing roles in avoiding zeros 
in the factorized matrices in Eqs. (jlSp and Eq. (jl6p . whe re zeros are majo r reasons for worse 
performance in predicting unseen words in the test set (|Blei et al.l . l2003l ^. 

Conventionally , different traini r ig algorithms fo r LDA can be fairly compa red by the 
perplexity metric ( Blei et al. . 20031 : Asuncion et al. . 20091 : Hoffman et al. . 20ld ). 



Perplexity = exp 



Y.w,d^w,d^og{(i)e)^^d 



'llw,d-'^w,d 
OC ^ -X^,rflog(00)^,rf. 



(24) 



wA 



which has been previously interpreted as the geometric mean of the likelihood in the prob- 
abilistic framework. Comparing (j24p with (j20p . we find that the perplexity metric can be 
also interpreted as a KL divergence between the document-word matrix x and the multi- 
plication of two factorized matrices cf)0. Because the TBP algorithm directly minimizes the 
KL divergence (I22p . it often has a much lower predictive perplexity on unseen test data 
than both OS and VB algorithms in Section [2] for better topic modeling accuracy. This 
theoretical analysis has also been supported by extensive experiments in Section [H 

Indeed, L DA is a fu l l Bay esian counterpart of the probabilistic latent semantic anal- 



ysis fPLSA ) (IHofmannl. 120011). which is equivalent to the NMF algorithm with the KL 



divergence <Ca,ussier and Goutti B. Moreover, the inference objective functions be- 
tween LDA and PLSA are ver y similar, and PLSA can be viewed a maximum-a-posteriori 
(M AP) estimated LDA m o del (iGirolami anc . Kabanl . [2003l ). For example, two recent stud- 



ies (jAsuncion et al.l . I2OO9I : IZeng et al.l . boill ) find that the CVBO and the simphfied BP 
algorithms for training LDA resemble the EM algorithm for training PLSA. Based on these 
previous works, it is a natural step to connect the NMF algorithms with those message 
passing algorithms for training LDA. More generally, we speculate that such intrinsic rela- 
tions a lso exist between fin i te mix ture models such as LDA and latent factor models such as 
NMF dOershman and Blei lioii ). As an example, the NMF algorithm in the ory has been 



recently justified to learn topic models such as LDA with a polynomial time (jArora et al 



2OI2I ). Notice that TBP and other NMF algorithms do not need to store previous messages 
within the message passing framework in Section [2l and thus save a lot of memory usage. 

Based on (jlSp and (jl6p . we implement two types of TBP algorithms: synchronous 
TBP (sTBP) and asynchronous TBP (aTBP), similar to the synchronous and asynchronous 
message passing algorithms in Section [2.4i Because the denominator of ()16p is a constant. 
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input : X, K, T, a, 13. 
output : (t>^v^j(,9KxD- 
1 begin 
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22 end 



// random initialization 
for ^ 1 to W, d^ltoD, x^,,d / do 
k ^ rand(A'); 

Sk,d ^ dk,d + Xw,d\ 
^ + Xw,d', 

end 

// The sTBP algorithm, 
for t ^ 1 to r do 

'I^WxK *7 4'WxK'^KxD ^ Okx 
'I^WxK' ^KxD, ^ 0; 

for d ^ 1 to A ^ 1 to VF, ■ 



1 to K, Xj^,d 7^ do 



Vk ^ U^,k + mk,d + a)/{\k + Wp)]/ Y.kU^,k 

Vk ^ Vk/ Y.kVk; 

(pw.k ^ 4>w.k + Xw,dVk\ 
Ok,d ^ f)k,d + Xw,dJ]k', 
Afc ^ \k + Xw,dTlk\ 



-fi){ek,d + a)/(Xk + W(3)]; 



end 



end 



Figure 2: The sTBP algorithm for LDA. 



it does not influence the normahzed message ()14p . So, we consider only the unnormalized 
9 during the matrix factorization. However, the denominator of (jl5p depends on k, so we 
use a i^T-tuple vector Xk to store the denominator, and use the unnormalized cf) during 
the matrix factorization. The normalization can be easily performed by a simple division 

Fig. [2] shows the synchronous TBP (sTBP) algorithm. We use three temporary matrices 
^, 0, A in Line 2 to store numerators of (jlSp . (jl6p and denominator of (|15p for synchro- 
nization. Form Line 4 to 9, we randomly initialize cj),9,X by rand(K), which generates a 
random integer k,l < k < K. At each training iteration t,l < t < T, we copy the tem- 
porary matrices to cf), 6, A and clear the temporary matrices to zeros from Line 12 to 13. 
Then, for each non-zero element in the document-word matrix, we accumulate the numera- 
tors of (jlSp . ()16p and the denominator of (llSp by the X-tuple message r]k in the temporary 
matrices cj), 6, A from Line 15 to 19. In the synchronous schedule, the update of elements 
in the factorized matrices does not influence other elements within each iteration t. 

Fig. El shows the asynchronous TBP (aTBP) algorithm. Unlike the sTBP algorithm, 
aTBP does not require temporary matrices 0, 9, A. After the random initialization from 
Line 4 to 9, aTBP reduces the matrices (f),6,X in a certain proportion from Line 13 to 
15, which can be compensated by the updated -ftT-tuple message 1]^ from Line 17 to 20. 
In the asynchronous schedule, the change of elements in the factorized matrices cf), 6 will 
immediately influence the update of other elements. In anticipation, the asynchronous 
schedule is more efficient to pass the influence of the updated elements in matrices than the 
synchronous schedule. The sTBP and aTBP algorithms will iterate until the convergence 
condition is satisfled or the maximum iteration T is reached. 
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input : X, K, T, a, 13. 

output : (t>WxK,OKxD- 

begin 

// random initialization 
tor w ^1 to W, d ^ 1 to D, .t^,^ / do 
k ^ rand(A'); 



4>n],l; ^ 4>w,k + ^w.d', 
Sk,d ^ dk,d + Xw,d\ 
^ + Xw,d', 

end 

// The aTBP algorithm 
for t ^ 1 to r do 

for d ^ 1 to D, ?<; ^ 1 to W, k ^ 1 to K, x^^d 7^ do 

0iu,fc ^ (1 - ^w,dl J2d ^iii.d)4>w,k\ 

dk.d ^ (1 - Xw,d/ x^^d)6k,d' 

Xk ^ (1 - ^'W,d/J2n,.d^'>^.d)Xk; 

Vk ^ Uu,.k + mk.d + a)/{\k + WI3)]/ 
011., fc ^ 4>w,k + x^,drik\ 

dk,d ^ Sk,d + Xm,dJ]k', 

Xk ^ Xk + Xu,,drik\ 



K,k + [mSk.d + a)/{Xk + WP)]- 



end 



end 



end 



Figure 3: The aTBP algorithm for LDA. 



The time complexity of TBP is 0{NNZ x KT), where NNZ is the number of non-zero 
elements in the document-word matrix, K is the number of topics and T is the number of 
training iterations. sTBP has the space complexity 0{3 x NNZ + 2x KW + 2x KD), but 
aTBP has the space complexity 0(3 x NNZ + KW + KD). Generally, we use 3 x NNZ 
to store data in the memory including indices of non-zero elements in the document-word 
matrix, and also use KW + KD memory to store matrices cf) and 0. Because sTBP uses 
additional matrices cf), 6 for synchronization, it uses 2 x KW + 2 x KD for all matrices. 



3.2 Data Memory 

When the corpus data is larger than the computer memory, traditional algorithms cannot 
train LDA due to the memory limitation. We assume that the hard disk is large enough 
to store the corpus fi le. Recently, re ading data from hard disk into memory as blocks is 
a promismg method (IYu et al.l . I2OI0I ) to handle such problems. We can extend the TBP 



algorithms in Figs. [2] and [3] to read the corpus file as blocks, and optimize each block 
sequentially. For example, we can read each document in the corpus file at one time into 
memory and perform the TBP algorithms to refine the matrices {(f), 6}. After scanning all 
documents in the corpus data file, TBP finishes one iteration of training in Figs. [2] and [3j 
Similarly, we can also store the matrices {(f), 0} in the file on the hard disk when they are 
larger than computer memory. In such cases, TBP consumes almost no memory to do topic 
modeling. Because loading data into memory requires additional time, TBP running on files 
is around twice slower than that running on memory. For example, for the 7GB PUBMED 
corpus and K = 10, aTBP requires 259.64 seconds to scan the whole data file on the hard 
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disk, while it requires only 128.50 seconds to scan the entire data in the r nemory at each 



train ing iteration. Another choice is to extend TBP to the online learning (jHoffman et al 

which partitions the whole corpus file into mini-batches and optimizes each mini- 
batch after one look sequentially. Although some online topic modeling algorithms like 
OVB can converge to the objective of corresponding batch topic modeling algorithm, we 
find that the best to pic modeling accuracy depends on several heuristic parameters including 
the mini-batch size (jHoffman et al.l . I2OI0I I. In contrast, TBP is a batch learning algorithm 
that can handle large data memory with better topic modeling accuracy. Reading block 
data from hard disk to memory can be also applied to both GS and VB algorithms for LDA. 



3.3 Relationship to Previous Algorithms 

The proposed TBP connects the training algorithm of LDA to the NMF algorithm with 
KL divergence. The intrinsi c relation between pro babilistic topic models (jHofmannl . 12001 



eral 



previous 



Blei et al.l . l2003l ) an d NMF (ILee a.rid Seund. l200lh have been extensively discussed in sev 



works (Buntind. l2002l:lGaussier and Gouttd. l2005l:lGirolami and Kaban . I2OO.4 



Wahabzada and Kerstina . 2011 : Wahabzada et al 



2011 



Zeng et al.l . l201lh . A more re- 



cent work shows that learning topic models by NMF has a polynomial time (lArora et al 
2OI2I ). Generally speaking, learning topic models can be formulated within the mes 



sage p assing framework in S ection [2] based on the generalized expectation-maximization 



(EM) (jPempster et al.l . 119771 ) algorithm. The objective is to maximize the joint distribu- 
tion of PLSA or LDA in two iterative steps. At the E-step, we approximately infer the 
marginal distribution of a topic label assigned to a word called message. At the M-step, 
based on the normalized messages, we estimate two multinomial parameters according to 
the maximum-likelihood criterion. The EM algorithm iterates until converges to the local 
optimum. On the other hand, the N MF algorithm with KL divergence has a probabilistic 
interpretation (iLee and Seungl . l200lh . which views the multiplication of two factorized ma- 
trices as the normal ized probability distribution. Notice that the widely-used pe rformance 
measure perplexity ( Blei et al. . 20031 : Asuncion et al. . 20091 : Hoffman et al. . 20ld ) for topic 
models follows exactly the same KL divergence in NMF, which implies that the NMF al- 
gorithm may achieve a lower perplexity in learning topic models. Therefore, connecting 
NMF with LDA may inspire more efficient algorithms to learn topic models. For example, 
in this paper, we show that the proposed TBP can avoid storing messages to reduce the 
memory us age. More generally , we s peculate that finite mixture models and latent fac- 
tor models ( Gershman and Blei . 20121 ) may share similar learning techniques, which may 
inspire more efficient training algorithms to each other in the near future. 



4. Experimental Results 

Our experiments aim to confirm the less mem ory usage of TBP com pared with the state-of- 



the-a rt batch le arning algor i thms such as VB dBlei et ahLliooi l. GS (jOriffiths and StevversI , 



20041 ^ and BP dZentr et al 



onlin e VB (OVB) ( 
sets (Porteous et al 



2011 



Joffman 



200. 



et al 



a lgorit hms, as well as online learning algorithm such as 
2010l'l. We us e four publicly available document data 
gOld l: ENRON, NYTIMES, PUBMED and 



Hoffman et al 



WIKI. Previous studies ( Porteous et al. . 20081 ) revealed that the topic modeling result is 



relatively insensitive to the total number of documents in the corpus. Because of the mem- 
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Table 1: Statistics of four document data sets. 



Data sets 


D 


w 


Nd 


Wd 


ENRON 


39861 


28102 


160.9 


93.1 


NYTIMES 


15000 


84258 


328.7 


230.2 


PUBMED 


80000 


76878 


68.4 


46.7 


WIKI 


10000 


77896 


1013.3 


447.2 



ory limitation for GS, BP and VB algorithms, we randomly select 15000 documents from 
the original NYTIMES data set, 80000 documents from the original PUBMED data set, 
and 10000 documents from the original WIKI data set for experiments. Table [T] summarizes 
the statistics of four data sets, where D is the total number of documents in the corpus, 
W is the number of words in the vocabulary, Nd is the average number of word tokens per 
document, and Wd is the average number of word indices per document. 

We randomly partition each data set into halves with one for training set and the other 
for test set. The training perplexity (j24p is calculated on the training set in 500 iterations. 
Usually, the training perplexity will decrease with the increase of number of training it- 
erations. The algorithm often converges if the change of training perplexity at successive 
iterations is less than a predefined threshold. In our experiments, we set the threshold as 
one because the decrease of training perplexity is very small after satisfyin g this threshold . 



The predictive perplexity for the unseen test set is computed as follows ([Asuncion et al. 



20091 ^. On the training set, we estimate cf) from the same random initialization after 500 it' 



erations. For the test set, we randomly partition each document into 80% and 20% subsets. 
Fixing cf), we estimate 6 on the 80% subset by training algorithms from the same random 
initialization after 500 iterations, and then calculate the predictive perplexity on the rest 
20% subset, 

,20% ■ 



predictive perplexity = exp I — ^"'''^ — — — ^ , (25) 

where x'^'^ denotes word counts in the the 20% subset. The lower predictive perplexity 
represents a better generalization ability. 



4.1 Comparison w^ith Batch Learning Algorithms 

We compare TBP with other batch learning algorithms such as G S, BP and VB. For a ll 
data sets, we fix the same hyperparameters as a = 2/K,l5 = 0.01 ( Porteous et al. . 20081 ) . 



The CPU time per iteration is measured after sweeping the entire data set. We report the 
average CPU time per iteration after T = 500 iterations, which practically ensures that 
GS, BP and VB converge in terms of training perplexity. For a fair comparison, we use the 
same random initialization to examine all algorithms with 500 iterations. To repeat ou r 



experiments, we have made all source codes and data sets publicly available (IZengj . l2012l ). 
These experiments are run on the Sun Fire X4270 M2 server with two 6-core 3.46 GHz 
CPUs and 128 GB RAMs. 

Table [2] compares the the message memory usage during training. VB and BP consumes 
more than 1GB memory to message passing when K = 100. VB and BP even require more 
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Table 2: Message memory (MBytes) for training set when K = 100. 



Inference methods 


ENRON 


NYTIMES 


PUBMED 


WIKI 


VB and BP 


1433.6 


1323.1 


1425.1 


1705.0 


GS 


12.6 


9.5 


10.4 


19.3 


aTBP and sTBP 
















Figure 4: Predictive perplexity as = {100,300,500,700,900} on ENRON, NYTIMES, 
PUBMED and WIKI data sets. The notation 0.8x and 0.4x denote the predictive 
perplexity is multiplied by 0.8 and 0.4, respectively. 



than 9GB for message passing when K = 900, because their message memory increases 
linearly with the number of topics K in Eq. ([7]). In contrast, GS needs only 10 ^ 20MB 
memory for message passing. The advantage of GS is that its m emory occupanc y does not 
depend on the number of topics K in Eq. ([3]). Therefore, PGS ( Newman et al.l . 12009 ) can 



handle the relatively large-scale data set containing thousands of topics without memory 
problems using the parallel architecture. However, PGS still requires message memory for 
message passing at the distributed computing unit. Clearly, both aTBP and sTBP do not 
need memory space to store previous messages, and thus save a lot of memory usage. This is 
a significant improvement especially compared with VB and BP algorithms. In conclusion, 
TBP is our first choice for batch topic modeling when memory is limited for topic modeling 
of massive corpora containing a large number of topics. 

Fig. m shows the topic modeling accuracy measured by the predictive perplexity on the 
unseen test set. The lower predictive perplexity implies a better topic modeling perfor- 
mance. Obviously, VB performs the worst among all batch learning algorithms with the 
highest predictive perplexity. For a better illustration, we multiply VB's perplexity by 0.8 
on ENRON and NYTIMES, and by 0.4 on PUBMED data sets, respectively. Also, we find 
that VB shows an overfitting phenomenon, where the predictive perplexity increases with 
the increase of the number of topics K on all data sets. The basic reason is that VB opti- 
mizes an approximate variational distribution with the gap to the true distribution. When 
the number of topics is large, this gap cannot be ignored, leading to serious biases. We 
see that GS performs much better than VB on all data sets, because it theoretically ap- 
proximates the true distribution by sampling techniques. BP always achieves a much lower 
predictive perplexity than GS, because it retains all uncertainty of messages without sam- 
phng. Both sTBP and aTBP perform equally well on ENRON and PUBMED data sets. 
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500 1000 500 1000 500 1000 500 1000 



Figure 5: CPU time per iteration (seconds) as K = {100,300,500,700,900} on ENRON, 
NYTIMES, PUBMED and WIKI data sets. The notation 0.3x denotes the train- 
ing time is multiphed by 0.3. 



which also achieve the lowest predictive perplexity among all batch training algorithms. 
However, BP outperforms both sTBP and aTBP on NYTIMES and WIKI data sets. Also, 
aTBP outperforms both sTBP and GS, while sTBP performs slightly worse than GS. Be- 
cause aTBP has consistently better topic modeling accuracy than GS on all data sets, we 
advocate aTBP for topic modeling in limited memory. As we discussed in Section 13. H 
BP/TBP has the lowest predictive perplexity mainly because it directly minimizes the KL 
divergence between x and cf)6 from the NMF perspective. 

Fig. [5] shows the CPU time per iteration of all algorithms. All these algorithms has a 
linear time complexity of K. VB i s the most time-cqnsuin i ng because i t invo lves complicated 
digamma function computation ( Asuncion et al. . 20091 : Zeng et aP . boill ). For a better 



illustration, we multiply the VB's traini i ig tin ie by 0.3. Although BP runs faster than GS 



when K is small {K < 100) (jZeng et al.l . l201ll ). it is sometimes slower than GS when K is 



large {K > 100), especially on ENRON and PUBMED data sets. The reason lies in that 
GS often randomly samples a topic label without visiting all K topics, while BP requires 
searching all K topics for the message update. When K is very large, this slight difference 
will be enlarged. sTBP runs as fast as BP in most cases, but aTBP runs slightly slower than 
both sTBP and BP. Comparing two algorithms in Figs. [2] and [Sj we find that aTBP uses 
more division operations than sTBP at each training iteration, which accounts for aTBP's 
slowness. As a summary, TBP has a comparable topic modeling speed as GS and BP but 
with reduced memory usage. 

Fig. [6] shows the training perplexity as a function of training iterations. All algorithms 
converge to a fixed point given enough training iterations. On all data sets, VB usually uses 
110 ~ 170 iterations, GS uses around 400 ~ 470 iterations, and BP/TBP uses 180 ~ 230 
iterations for convergence. Although the digamma function calculation is slow, it reduces 
the number of training iterations of VB to reach convergence. GS is a stochastic sampling 
method, and thus requires more iterations to approximate the true distribution. Because 
BP/TBP is a deterministic message passing method, it needs less iterations to achieve 
convergence than GS. Overall, BP/TBP consumes the least training time until convergence 
according to Figs. [5]and[6l 
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Figure 6: Training perplexity as a function of the number of iterations when K = 500 on 
ENRON, NYTIMES, PUBMED and WIKI data sets. 



Fig. [T| shows the top ten words of = 10 topics extracted by VB (red), GS (blue), 
BP (black), aTBP (green) and sTBP (magenta). We see that most topics contain similar 
top ten words but with a different order. More formally, we can adopt subjec tive measures 



such as the word intrusion in topics and the topic intrusion in documents (iChang et al 



2OO9I I to evaluate extracted topics. PUBMED is a biomedical corpus. According to our 



prior knowledge in biomedical domain, we find these topics are all meaningful. Under this 
condition, we advocate TBP for topic modeling with reduced memory requirements. 

4.2 Comparison with Online Algorithms 

We compare the topic modeling perf ormance be tween TBP and the state-of-the-art online 
topic modeling algorithm OVB (Hof fman et al. . [20 10.i^ on a desktop com puter with 2GB 



memory. The complete 7GB PUBMED data set ( Porteous et al. . 20081 ) contains a total 
of D = 820,000,000 documents with a vocabulary size W = 141,043. Currently, only 
TBP and online topic mod eling methods can handle 7GB data set using 2GB memory. 



OVB (jHoffman et al.l . l20ld ) uses the following default parameters: k = 0.5, tq = 1024, and 



the mini-batch size S = 1024. We randomly reserve 40, 000 documents as the test set, and 
use the remainder 8, 160, 000 documents as the training set. The number of topics K = 10. 
The hyperparameters a = 2/K = 0.05 and (3 = 0.01. 

Fig. [8] shows the predictive perplexity as a function of training time (seconds in log scale). 
OVB converges slower than TBP because it reads input data as a data stream, discarding 
each mini-batch sequentially after one look. Notice that, for each mini-batch, OVB still 
requires allocating message memory for computation. In contrast, TBP achieves a much 
lower perplexity using less memory usage and training time. There are two major reasons. 
First, TBP directly optimizes the perplexity in terms of the KL divergence in Eq ()24p . so that 
it can achieve a much lower perplexity than OVB. Second, OVB involves computationally 
expensive digamma functions which significantly slow down the speed. We see that sTBP is 
a bit faster than aTBP because it does not perform the division operation at each iteration 
(see Figs. [2]and[3|). Because aTBP influences matrix factorization immediately after the 
matrix update, it converges at a slightly lower perplexity than sTBP. 



^http: //www, cs .princeton. edu/~blei/topicmodeling.htiiil| 
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Topic 1 


cell mice antigen lymphocytes normal effect tumor human activity strain 
cell antigen strain virus mice infection antibody antibodies serum test 
cell antigen strain virus mice infection antibody antibodies serum test 
cell antigen strain mice virus antibody infection antibodies serum human 
cell antigen mice strain infection virus antibody serum antibodies human 


Topic 2 


patient treatment disease clinical therapy tumor diagnosis children year cancer 
patient disease treatment clinical tumor lesion therapy diagnosis syndrome carcinoma 
patient tumor lesion treatment clinical surgery diagnosis disease operation carcinoma 
patient disease treatment clinical tumor lesion diagnosis therapy syndrome complication 
patient disease treatment clinical lesion therapy diagnosis tumor syndrome chronic 


Topic 3 


rat level effect day plasma concentration activity normal cell hormone 
level rat serum day plasma normal concentration liver effect animal 

level rat concentration plasma serum liver blood normal effect glucose 
rat level plasma effect serum concenlralion day nonnal liver animal 
level rat plasma effect day serum concentration normal administration animal 


Topic 4 


effect concentration rat activity level transport glucose sodium acid muscle 
muscle activity nerve stimulation response neuron effect cat responses potential 
pressure muscle heart dog cardiac nerve activity response effect stimulation 
effect activity response muscle brain stimulation responses ral drug nerve 
effect response muscle brain activity stimulation responses nerve rat drug 


Topic 5 


children patient data problem population test effect age factor program 
children test patient data problem subject population factor program hospital 
patient disease children age syndrome infant clinical normal year incidence 
children patient age infant subject population problem test data factor 
children test subject problem data population age factor program hospital 


Topic 6 


patient pressure lesion normal clinical disease pulmonary coronary cardiac dog 
patient pressure heart cardiac pulmonary dog normal left coronary arterial 
test system data problem patient subject program new analysis clinical 
patient dog pressure renal heart cardiac normal blood pulmonary lung 
pressure patient dog cardiac heart pulmonary normal left right flow 


Topic 7 


effect neuron cell activity response stimulation rat cat responses brain 
effect rat concentration drug level administration action insulin dose glucose 
effect rat drug level administration activity response treatment action dose 
activity concentration enzyme effect acid inhibition ph transport cell uptake 
activity acid concentration effect liver enzyme rat uptake inhibition transport 


Topic 8 


concentration effect measurement system acid drug water solution determination values 
acid concentration ph solution temperature effect reaction compound degrees*c water 
concentration effect ph transport solution temperature membrane degrees*c rate water 
measurement system determination solution model water temperature analysis technique concentration 
solution temperature concentration measurement determination ph water model reaction degrees*c 


Topic 9 


protein cell acid activity enzyme dna ma fraction ph isolated 
protein activity enzyme dna fraction acid binding ma synthesis enzymes 
activity protein acid enzyme fraction enzymes amino*acid binding synthesis isolated 
protein dna acid rna fraction mutant isolated molecular*weight synthesis chain 
protein dna binding fraction rna enzyme mutant activity acid molecular*weight 


Topic 10 


patient semm test level antigen antibody antibodies infection normal effect 
cell tissue number normal surface human membrane chromosome electron*microscopy formation 
cell dna ma mutant number region growth nuclei normal structure 
cell tissue number normal surface layer large development membrane section 
antigen infection strain vims antibody test semm antibodies mice human 



Figure 7: Top ten words in ten topics extracted from the subset of PUBMED: VB (red), 
GS (blue), BP (black), aTBP (green) and sTBP (magenta). Most topics contain 



simil ar words but with a different order. The subjective measures (IChang et al 



20091 ) such as word intrusions in topics and topic intrusions in documents are 



comparable among different algorithms. 
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Figure 8: Predictive perplexity obtained on the complete PUBMED corpus as a function 
of CPU time (seconds in log scale) when K = 10. 



5. Conclusions 



This paper has presented a novel tiny belief propagation (TBP) algorithm for training LDA 
with significantly reduced memory requirements. The TBP algorithm reduces the message 
memory required by conventional message passing algorithms including GS, BP and VB . 
We a lso discuss the intrinsic relation between the proposed TBP and NMF (jLee and Seunei . 

)Oll ) with KL divergence. We find that TBP can be approximately viewed as a special 
NMF algorithm for minimizing the perplexity metric, which is a widely-used evaluation 
method for different training algorithms of LDA. In addition, we confirm the superior topic 
modeling accuracy of TBP in terms of predictive perplexity on extensive experiments. For 
example, when compared with the state-of-the-art online topic modeling algorithm OVB, 
the proposed TBP is faster and more accurate to extract 10 topics from 7GB PUBMED 
corpus using a desktop computer with 2GB memory. 

Recently, the NMF a lgorithm has been advocated to learn topic models such as LDA 



with a polynomial time (jArora et al.l . |2012| ). The proposed TBP algorithm also suggests 



that the NMF algorithms can be applied to training topic models like LDA with a high 
accuracy in terms of the perplexity metric. W e hope that our results may inspire more and 
more NMF algorithms ( Lee and Seunei . 2001) to be extended to learn other complicated 
LDA-based topic models (jBlei|, l20]i) in the near future. 
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